
Lecture 07: Flow control¶
0.1.0 About Introduction to R¶
Introduction to R is brought to you by the Centre for the Analysis of Genome Evolution & Function (CAGEF) bioinformatics training initiative. This course was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
The structure of this course is a code-along style; It is 100% hands on! A few hours prior to each lecture, links to the materials will be available for download at QUERCUS. The teaching materials will consist of a Jupyter Notebook with concepts, comments, instructions, and blank coding spaces that you will fill out with R by coding along with the instructor. Other teaching materials include a live-updating HTML version of the notebook, and datasets to import into R - when required. This learning approach will allow you to spend the time coding and not taking notes!
As we go along, there will be some in-class challenge questions for you to solve either individually or in cooperation with your peers. Post lecture assessments will also be available (see syllabus for grading scheme and percentages of the final mark) through DataCamp to help cement and/or extend what you learn each week.
0.1.1 Where is this course headed?¶
We'll take a blank slate approach here to R and assume that you pretty much know nothing about programming. From the beginning of this course to the end, we want to take you from some potential scenarios such as...
A pile of data (like an excel file or tab-separated file) full of experimental observations and you don't know what to do with it.
Maybe you're manipulating large tables all in excel, making custom formulas and pivot tables with graphs. Now you have to repeat similar experiments and do the analysis again.
You're generating high-throughput data and there aren't any bioinformaticians around to help you sort it out.
You heard about R and what it could do for your data analysis but don't know what that means or where to start.
and get you to a point where you can...
Format your data correctly for analysis
Produce basic plots and perform exploratory analysis
Make functions and scripts for re-analysing existing or new data sets
Track your experiments in a digital notebook like Jupyter!

0.1.2 How do we get there? Step-by-step.¶
In the first lesson, we will talk about the basic data structures and objects in R, get cozy with the Jupyter Notebook environment, and learn how to get help when you are stuck because everyone gets stuck - a lot! Then you will learn how to get your data in and out of R, how to tidy our data (data wrangling), and then subset and merge data. After that, we will dig into the data and learn how to make basic plots for both exploratory data analysis and publication. We'll follow that up with data cleaning and string manipulation; this is really the battleground of coding - getting your data into just the right format where you can analyse it more easily. We'll then spend a lecture digging into the functions available for the statistical anlaysis of your data. Lastly, we will learn about control flow how to write customized functions, which can really save you time and help scale up your analyses.

Don't forget, the structure of the class is a code-along style: it is fully hands on. At the end of each lecture, the complete notes will be made available in a PDF format through the corresponding Quercus module so you don't have to spend your attention on taking notes.
0.1.3 What kind of coding style will we learn?¶
There is no single path correct from A to B - although some paths may be more elegant, or more efficient than others. With that in mind, the emphasis in this lecture series will be on:
- Code simplicity - learn helpful functions that allow you to focus on understanding the basic tenets of good data wrangling (reformatting) to facilitate quick exploratory data analysis and visualization.
- Code readability - format and comment your code for yourself and others so that even those with minimal experience in R will be able to quickly grasp the overall steps in your code.
- Code stability - while the core R code is relatively stable, behaviours of functions can still change with updates. There are well-developed packages we'll focus on for our analyses. Namely, we'll become more familiar with the
tidyverseseries of packages. This resource is well-maintained by a large community of developers. While not always the "fastest" approach, this additional layer can help ensure your code still runs (somewhat) smoothly later down the road.
0.2.0 Class Objectives¶
This is the final in a series of seven lectures. Last lecture we explored the realm of statistical analyses with linear regression and other general linear models. Now we arrive at the final destination, addressing how to create looping and branching code, as well as our own functions in the topic of control flow. At the end of this session we will have covered:
- Control of flow statements.
- Combining control flow with useful functions.
- Build your own function in R.
- Saving data and your workspace.

0.3.0 A legend for text format in Jupyter markdown¶
Grey background: Command-line code, R library and function names- Bold italics: Emphasis for important ideas and concepts
- Bold: Headers and subheaders
- Blue text: Named or unnamed hyperlinks
...fill in the code here if you are coding along
0.4.0 Lecture and data files used in this course¶
0.4.1 Weekly Lecture and skeleton files¶
Each week, new lesson files will appear within your JupyterHub folders. We are pulling from a GitHub repository using this Repository git-pull link. Simply click on the link and it will take you to the University of Toronto JupyterHub. You will need to use your UTORid credentials to complete the login process. From there you will find each week's lecture files in the directory /2023-09-IntroR/Lecture_XX. You will find a partially coded skeleton.ipynb file as well as all of the data files necessary to run the week's lecture.
Alternatively, you can download the Jupyter Notebook (.ipynb) and data files from JupyterHub to your personal computer if you would like to run independently of the JupyterHub.
0.4.2 Live-coding HTML page¶
A live lecture version will be available at camok.github.io that will update as the lecture progresses. Be sure to refresh to take a look if you get lost!
0.4.3 Post-lecture PDFs and Recordings¶
As mentioned above, at the end of each lecture there will be a completed version of the lecture code released as a PDF file under the Modules section of Quercus.
0.4.4 Microsporidia infection data set description¶
The following datasets used in this week's class come from a published manuscript on PLoS Pathogens entitled "High-throughput phenotyping of infection by diverse microsporidia species reveals a wild C. elegans strain with opposing resistance and susceptibility traits" by Mok et al., 2023. These datasets focus on the an analysis of infection in wild isolate strains of the nematode C. elegans by environmental pathogens known as microsporidia. The authors collected embryo counts from individual animals in the population after population-wide infection by microsporidia and we'll spend our next few classes working with the dataset to learn how to format and manipulate it.
0.4.4.1 Dataset 1: embryo_data_long_merged.csv¶
It's the last time we'll be working with this dataset that we carefully created. It will help us work through the different aspects of control flow.
0.4.4.2 Source file: lecture07.R¶
We'll be using this source file later to show how you can save your own functions and import them for data analysis.
0.5.0 Packages used in this lesson¶
The following packages are used in this lesson:
tidyverse(tidyverse installs several packages for you, likedplyr,readr,readxl,tibble, andggplot2). In particular we will be taking advantage of thestringrpackage this week.viridisour colour-blind friendly package for providing specific colour palettes to our visualizations
Some of these packages should already be installed into your Anaconda base from previous lectures. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda.
conda install -c conda-forge r-biocmanager
BiocManager::install("limma")
conda install -c conda-forge r-gee
conda install -c conda-forge r-multcomp
#--------- Install packages to for today's session ----------#
# install.packages("tidyverse", dependencies = TRUE) # This package should already be installed on Jupyter Hub
#--------- Load packages to for today's session ----------#
library(tidyverse)
library(viridis)
-- Attaching core tidyverse packages ---------------------------------------------------------------- tidyverse 2.0.0 -- v dplyr 1.1.0 v readr 2.1.4 v forcats 1.0.0 v stringr 1.5.0 v ggplot2 3.4.3 v tibble 3.2.1 v lubridate 1.9.2 v tidyr 1.3.0 v purrr 1.0.2 -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors Loading required package: viridisLite
1.0.0 Control flow moves you beyond linear programming¶
![]() |
|---|
| Don't repeat code when you can use flow control! |
Although we have only briefly touched on some of the aspects regarding control flow, it has been implemented behind the scenes in many of the functions you've used throughout this course. From your experience in Jupyter Notebooks, the order in which a code cell's individual statements or instructions are executed can be considered part of control flow. Expanding on this idea, when you see the number order of the code cells, this also indicates the control flow of the entire notebook or program. Once a code cell is run, the objects it has generated remain stored in memory and available for access.
Within our code cells and overall program, control flow can involve statements that help to generate choice loops, conditional statements, and move throughout the program. These specific statements allow us to run different blocks of code at different times. This can be accomplished through
- for loops
- cycling through values
- if statements
- while and repeat loops
- next and break
In this lecture, we'll touch on all of these concepts to give you a taste of how you can make your programs accomplish more with less actual code. Let's start by loading up an example dataset to play around with.
# set working directory
getwd()
list.files("./data")
# read our file in with read_csv()
embryos.df <- read_csv("data/embryo_data_long_merged.csv", col_types = "cdffffdfllddfddffff")
# explore our loaded data frame
head(embryos.df)
- '190423_boxplot.facet.png'
- '190423_boxplot.facet.saveFunction.png'
- '190423_boxplot.png'
- '200704_boxplot.facet.makeFunction.png'
- '200704_boxplot.facet.png'
- '200704_boxplot.facet.saveFunction.png'
- '200704_boxplot.png'
- '200704_graph.facet.saveFunction.tryCatchv2.png'
- '200704_graph.facet.saveFunction2.png'
- '200711_boxplot.facet.makeFunction.png'
- '200711_boxplot.facet.png'
- '200711_boxplot.png'
- '200718_boxplot.facet.makeFunction.png'
- '200718_boxplot.facet.png'
- '200718_boxplot.png'
- '200818_boxplot.facet.png'
- '200818_boxplot.png'
- '200822_boxplot.facet.png'
- '200822_boxplot.png'
- '200901_boxplot.facet.png'
- '200901_boxplot.png'
- '200912_boxplot.facet.png'
- '200912_boxplot.png'
- '200915_boxplot.facet.png'
- '200915_boxplot.png'
- '221019_graph.facet.saveFunction.tryCatchv2.png'
- 'embryo_data_long_merged.csv'
- 'Lecture07.all.RData'
- 'Lecture07.R'
- 'Lecture07.Rdata'
- 'old'
| experiment | wormNumber | infectionDate | wormStrain | sporeStrain | sporeDose | sporesM_cm2 | doseLevel | spores | meronts | embryos | expTimepoint | infectionType | totalWorms | plateSize | fixingDate | stainingDate | slideDate | imagingDate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <dbl> | <fct> | <fct> | <fct> | <fct> | <dbl> | <fct> | <lgl> | <lgl> | <dbl> | <dbl> | <fct> | <dbl> | <dbl> | <fct> | <fct> | <fct> | <fct> |
| 200707_N2_LUAm1_0M_72hpi | 1 | 200704 | N2 | LUAm1 | 0 | 0.000000 | Mock | FALSE | FALSE | 18 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
| 200707_N2_LUAm1_10M_72hpi | 1 | 200704 | N2 | LUAm1 | 10 | 0.353857 | Medium | FALSE | TRUE | 7 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
| 200707_JU1400_LUAm1_0M_72hpi | 1 | 200704 | JU1400 | LUAm1 | 0 | 0.000000 | Mock | FALSE | FALSE | 10 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
| 200707_JU1400_LUAm1_10M_72hpi | 1 | 200704 | JU1400 | LUAm1 | 10 | 0.353857 | Medium | FALSE | TRUE | 0 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
| 200707_ED3052A_LUAm1_0M_72hpi | 1 | 200704 | ED3052A | LUAm1 | 0 | 0.000000 | Mock | FALSE | FALSE | 12 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
| 200707_ED3052A_LUAm1_10M_72hpi | 1 | 200704 | ED3052A | LUAm1 | 10 | 0.353857 | Medium | FALSE | TRUE | 0 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
1.1.0 Use for() loops to repeat commands for a maximum number of iterations¶
R doesn't care if you write the same code 1000 times or have the interpreter repeat a single copy 1000 times. However, the second is a lot easier for you. The for() loop helps to reduce code replication by compartmentalizing a set of instructions to repeat instead of copying and pasting the same code several times.
More specifically, a for() loop executes a statement repetitively until a well-defined endpoint. In this case, it determines when a specific variable's value is no longer contained in a given sequence.
For example, let's say that we want to add a + 2 10 times and overwrite it everytime:
# Increment a by 2, the bad way...
a <- 2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a <- a+2
a
Sure, 10 times is doable by hand, just copy-paste. But what if you need to perform that same task, say 1,000 times? What if the code was more complex than a <- 2? That is when for() loops come to the rescue.
# Increment 'anything' using a for loop
anything = 10
# Set up your for loop with a 'tally' count
for (tally in 1:1000) {
anything <- anything + 2
}
tally
anything
1.1.1 The for loop can be described in three stages¶
for(x in y): Set a variablexto equal the next value in a given sequencey{ code to run }Run a set of code which can use the variablexat its assigned value in the cycle- Repeat (hence the loop)
There are a number of ways to set the counting variable within the for() initialization. In reality, you just need to supply a vector of elements for it to iterate through. This could be a sequence where y is defined as a:b, or a numeric vector, or even a vector of objects! Each of these is assigned to x in our loop and must be used appropriately.
Note that without {...} enclosing your code, R will run only the first statement right after the for() call. This can exist on the same line, or on the next line. Subsequent lines, regardless of indentation, will not be run as part of the loop. This behaviour lets you quickly build a simple for() loop or you can extend the behaviour to accomplish many or more complex tasks.
Let's take a look at the seq() function and how you can use it within a for() loop.
# Use the seq() function
seq(from = 1, to = 10, by = 0.5)
# let's use seq() in a for loop to count, no braces but indentation
for(variable in seq(1, 10, 0.5))
print(variable)
print("middle but not really")
print("This is the end")
- 1
- 1.5
- 2
- 2.5
- 3
- 3.5
- 4
- 4.5
- 5
- 5.5
- 6
- 6.5
- 7
- 7.5
- 8
- 8.5
- 9
- 9.5
- 10
[1] 1 [1] 1.5 [1] 2 [1] 2.5 [1] 3 [1] 3.5 [1] 4 [1] 4.5 [1] 5 [1] 5.5 [1] 6 [1] 6.5 [1] 7 [1] 7.5 [1] 8 [1] 8.5 [1] 9 [1] 9.5 [1] 10 [1] "middle but not really" [1] "This is the end"
# for loop on a single line
for(variable in seq(1, 10, 0.5)) print(variable); print("middle but not really"); print("This is the end")
[1] 1 [1] 1.5 [1] 2 [1] 2.5 [1] 3 [1] 3.5 [1] 4 [1] 4.5 [1] 5 [1] 5.5 [1] 6 [1] 6.5 [1] 7 [1] 7.5 [1] 8 [1] 8.5 [1] 9 [1] 9.5 [1] 10 [1] "middle but not really" [1] "This is the end"
# for loop on a single line, with brackets
for(variable in seq(1, 10, 0.5)) {print(variable); print("middle but not really")}; print("This is the end")
[1] 1 [1] "middle but not really" [1] 1.5 [1] "middle but not really" [1] 2 [1] "middle but not really" [1] 2.5 [1] "middle but not really" [1] 3 [1] "middle but not really" [1] 3.5 [1] "middle but not really" [1] 4 [1] "middle but not really" [1] 4.5 [1] "middle but not really" [1] 5 [1] "middle but not really" [1] 5.5 [1] "middle but not really" [1] 6 [1] "middle but not really" [1] 6.5 [1] "middle but not really" [1] 7 [1] "middle but not really" [1] 7.5 [1] "middle but not really" [1] 8 [1] "middle but not really" [1] 8.5 [1] "middle but not really" [1] 9 [1] "middle but not really" [1] 9.5 [1] "middle but not really" [1] 10 [1] "middle but not really" [1] "This is the end"
1.1.2 Common functions are just pre-programmed for() loops¶
As was mentioned at the start of this section, under the hood, many of the functions that we commonly use are just for() loops. We can easily replicate them with explicit for loops but it takes up extra coding time! For example, we can replicate the rep() function.
# Use the rep() function to print the number 1-5, 8 times
rep(x = 1:5, times = 8)
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
Let's duplicate the function of rep() with a for() loop!
# for loop version variables need to be set
rm(result, i) # Remove the variables result and i, if they exist.
x <- 1:5
n <- 8
result <- x # What happens if we remove this line?
# Build our for loop
for (i in 1:(n-1)){
result <- c(result, x)
print(result)
}
result
i
Warning message in rm(result, i): "object 'result' not found" Warning message in rm(result, i): "object 'i' not found"
[1] 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [39] 4 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
- 1
- 2
- 3
- 4
- 5
1.1.3 Self-referencing variables must be declared outside your loops¶
Why did we declare result <- x ahead of the for loop? It can get a little complicated but for our purposes, we can say that the offending issue lies within the for loop itself result <- c(result, x). Remember, when the kernel encounters this command, it tries to evaluate the right side of the assignment first. When it goes to look for result it does not exist and cannot complete the assignment. To help facilitate this, we need to declare result outside the loop.
There are a few ways we could do this such as with result <- NULL just so that it exists as an initialized placeholder. Instead we assigned it initially to hold the first iteration of our sequence. Either would have worked but would require different numbers of loop iterations.
If you declared result <- NULL or result <- x within the loop, it would repeat this command with every iteration, thus overwriting it back to a native state with each loop. Nothing would progress! We'll use this concept to springboard us into the idea of scope.
1.1.4 The scope (persistence) of variables is tied to when/where they are declared¶
Control flow statements as with other compartmentalized sections of code can be thought of as separate rooms in a house or sandboxes in a playground.
- The R kernel is somewhat like a person in a house (your program/script) that is navigating from room to room based on a set of instructions.
- If you've been following our lectures, the very first thing we do in each class is load our libraries. These packages are given precedence by their load order, with the most recent taking highest precedence. These can be considered like a toolbox carried around by the R kernel.
- When the kernel first enters the house, it is given a (Global) notepad to take notes (like variables)
- When the kernel wanders into a new room (ie a function) it is provided a new (local) notepad to write about any new variables it encounters. If it sees anything in this new room that pertains to the outside area, it will write it down on the Global notepad.
- When it leaves a room, it must leave behind the local notepad.
Thus a variable is either global or local in scope. If it is local, then the information about it simply disappears at the end of the function or control flow. The scope of a variable can usually be considered as between the {...} of a programming section. After you've left that section, anything explicitly declared within (ie new variables from that section) will be released from memory. Of course, R doesn't exactly play by those rules, and stray variables can float in memory. If you want to ensure that variables from something like a for loop remain local, you can use the local() command or create a function().
Why is scope important?
Understanding this concept will save you a lot of troubles down the road as you make more and more complex programs. You'll learn to avoid declaring variables in the wrong place, or trying to access ones that no longer exist in your scope. Let's revisit our example from above.
# Clear some memory and check th value of variables that may already exist
rm(result, j)
cat("The prior value of i is :", i, "\n")
# for loop version variables need to be set
x <- 1:5
n <- 8
result <- 100
# Build a local for loop - this completely isolates any new variables from the global scope
local(
for (i in 1:n){
result <- c(result, x)
print(result)
j <- result # assign a value to j
}
)
cat("The value of result is: ", result, "\n")
cat("The value of i is :", i, "\n")
cat("The value of j is :", j)
Warning message in rm(result, j): "object 'j' not found"
The prior value of i is : 7 [1] 100 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [20] 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [20] 4 5 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [20] 4 5 1 2 3 4 5 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [20] 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 [1] 100 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 [20] 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 [39] 3 4 5 The value of result is: 100 The value of i is : 7
Error in cat("The value of j is :", j): object 'j' not found
Traceback:
1. cat("The value of j is :", j)
1.1.4.1 The local() scope isolates your code from the global environment¶
What happened to our variable result? You can see that it was initially declared as the value of 100. When we entered the local() scope and then had the first iteration of our for() loop the code result <- c(result,x) looked locally first for the values of result and x but these variables did not exist so it pulled the values from the global environment. Subsequently a local result variable was then declared and assigned a value. This local version of result was updated with each iteration but the global version was never altered. Similarly, within the local() scope, the values of i were assigned to a new version of i within the function and never overwrote the original values of i in the main part of the code cell.
A similar effect is seen when creating and using your own functions (to be discussed) but you can see that the kernel searches for variables (and functions) in the local namespace before checking the global namespace, followed by the namespaces of the loaded packages.
1.1.5 Cycle through values using a for() loop¶
The most useful thing to do with a for loop is to cycle through values. Let's return to embryos.df and plot the total embryos for each observation across each infection date. As a twist we'll add each infection date one at a time using a loop until we get to the final version of our visualization.
# Pull down the structure and colnames of our embryos.df
str(embryos.df, give.attr = FALSE)
colnames(embryos.df)
spc_tbl_ [9,802 x 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ experiment : chr [1:9802] "200707_N2_LUAm1_0M_72hpi" "200707_N2_LUAm1_10M_72hpi" "200707_JU1400_LUAm1_0M_72hpi" "200707_JU1400_LUAm1_10M_72hpi" ... $ wormNumber : num [1:9802] 1 1 1 1 1 1 1 1 1 1 ... $ infectionDate: Factor w/ 9 levels "200704","200711",..: 1 1 1 1 1 1 1 1 1 1 ... $ wormStrain : Factor w/ 18 levels "N2","JU1400",..: 1 1 2 2 3 3 4 4 5 5 ... $ sporeStrain : Factor w/ 6 levels "LUAm1","MAM1",..: 1 1 1 1 1 1 1 1 1 1 ... $ sporeDose : Factor w/ 11 levels "0","10","4","3.5",..: 1 2 1 2 1 2 1 2 1 2 ... $ sporesM_cm2 : num [1:9802] 0 0.354 0 0.354 0 ... $ doseLevel : Factor w/ 6 levels "Mock","Medium",..: 1 2 1 2 1 2 1 2 1 2 ... $ spores : logi [1:9802] FALSE FALSE FALSE FALSE FALSE FALSE ... $ meronts : logi [1:9802] FALSE TRUE FALSE TRUE FALSE TRUE ... $ embryos : num [1:9802] 18 7 10 0 12 0 5 9 11 0 ... $ expTimepoint : num [1:9802] 72 72 72 72 72 72 72 72 72 72 ... $ infectionType: Factor w/ 1 level "continuous": 1 1 1 1 1 1 1 1 1 1 ... $ totalWorms : num [1:9802] 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ... $ plateSize : num [1:9802] 6 6 6 6 6 6 6 6 6 6 ... $ fixingDate : Factor w/ 9 levels "200707","200714",..: 1 1 1 1 1 1 1 1 1 1 ... $ stainingDate : Factor w/ 10 levels "200803","200810",..: 1 1 1 1 1 1 1 1 1 1 ... $ slideDate : Factor w/ 9 levels "200804","200811",..: 1 1 1 1 1 1 1 1 1 1 ... $ imagingDate : Factor w/ 13 levels "200806","200813",..: 1 1 1 1 1 1 1 1 1 1 ...
- 'experiment'
- 'wormNumber'
- 'infectionDate'
- 'wormStrain'
- 'sporeStrain'
- 'sporeDose'
- 'sporesM_cm2'
- 'doseLevel'
- 'spores'
- 'meronts'
- 'embryos'
- 'expTimepoint'
- 'infectionType'
- 'totalWorms'
- 'plateSize'
- 'fixingDate'
- 'stainingDate'
- 'slideDate'
- 'imagingDate'
# Grab a list of infection dates from the dataset
days = unique(embryos.df$infectionDate)
for (i in 1:length(days)) {
plot <-
embryos.df %>%
filter(infectionDate %in% days[1:i]) %>%
ggplot(.) +
# 2. Aesthetics
aes(x = infectionDate, y = embryos, colour = infectionDate) +
labs(title = paste0("Embryos per infection date with ", i, " days")) + # Add a title based on the day range
guides(colour = "none") +
# 4. Geoms
geom_jitter()
suppressWarnings(print(plot)) # Drop the warnings when we print the plot
Sys.sleep(2) # Pause the system for 2 seconds
}
1.1.6 Iterating through a vector of elements in a for() loop¶
Another handy feature of the for() loop in R is being able to directly give the loop a vector to iterator through until there are no elements left. This will come in handy when applying the same transformations, functions, or calculations on different subsets or elements within a vector.
We'll start with a simple example of looping through a small character vector.
# for loop in a single line, with brackets
for(variable in c("I", "You", "We all")) {
print(variable)
print("scream;")
}
print ("for ice cream")
[1] "I" [1] "scream;" [1] "You" [1] "scream;" [1] "We all" [1] "scream;" [1] "for ice cream"
Lets use a t.test() to look for embryo production differences between N2 and JU1400 animals when infected at medium dose levels by the microsporidia LUAm1. We'll use a for loop to gather this information across all days.
# Build a very specific subset of data looking at only N2 and JU1400 populations
# infected by a medium does of LUAm1
subdata <-
embryos.df %>%
filter(wormStrain %in% c("N2", "JU1400"),
sporeStrain == "LUAm1",
doseLevel == "Medium"
)
# create an empty data frame to store the output of the for loop
result <- data.frame(infectionDate = unique(subdata$infectionDate),
difference = NA,
p_value = NA)
result
# for loop to calculate difference in means between N2 and JU1400 infected by LUAm1 on the same date
for(i in result$infectionDate) {
# Generate a t-test on subset by day
t <- t.test(embryos ~ wormStrain, subdata[subdata$infectionDate == i, ])
# write the results to our data frame
result[result$infectionDate == i, "difference"] <- diff(t$estimate)
result[result$infectionDate == i, "p_value"] <- t$p.value
}
result
| infectionDate | difference | p_value |
|---|---|---|
| <fct> | <lgl> | <lgl> |
| 200704 | NA | NA |
| 200711 | NA | NA |
| 200718 | NA | NA |
| 200818 | NA | NA |
| 200822 | NA | NA |
| 200901 | NA | NA |
| 190423 | NA | NA |
| infectionDate | difference | p_value |
|---|---|---|
| <fct> | <dbl> | <dbl> |
| 200704 | -9.940000 | 1.737888e-25 |
| 200711 | -8.400000 | 6.851135e-20 |
| 200718 | -11.560000 | 1.451323e-28 |
| 200818 | -10.980000 | 4.447701e-25 |
| 200822 | -14.120000 | 1.175066e-34 |
| 200901 | -10.460000 | 8.480135e-28 |
| 190423 | -9.028205 | 2.445116e-23 |
1.1.6.1 for loops run beneath the group_by()¶
If the code from above seems familiar in idea, you might recognize that we are simply breaking the data into subgroups and performing a t.test on it.
We've seen this kind of paradigm before using the group_by() function in conjunction with summarise(). Using a call to group_by() we can make groups based on infectionDate and then passing along to summarise() will produce the calculations we want on each subgroup. In this case, the code is slightly cleaner and simplified compared to the for loop.
subdata_ttest <-
# Pass the subdata
subdata %>%
# Group the data
group_by(infectionDate) %>%
# Use Summarise to do the repetitive work for you
summarise(difference = diff(t.test(embryos ~ wormStrain)$estimate),
p_values = t.test(embryos ~ wormStrain)$p.value)
subdata_ttest
| infectionDate | difference | p_values |
|---|---|---|
| <fct> | <dbl> | <dbl> |
| 200704 | -9.940000 | 1.737888e-25 |
| 200711 | -8.400000 | 6.851135e-20 |
| 200718 | -11.560000 | 1.451323e-28 |
| 200818 | -10.980000 | 4.447701e-25 |
| 200822 | -14.120000 | 1.175066e-34 |
| 200901 | -10.460000 | 8.480135e-28 |
| 190423 | -9.028205 | 2.445116e-23 |
1.2.0 Generate conditional branches using if() statements¶
![]() |
|---|
| Conditional branching only runs code when criteria have been met! |
One of the big advantages of programming is to have conditional statements in your code. R can make binary decisions like "if data meets a condition, do this". Some of these happen implicitly as in a for() loop (ie keep repeating the code until you run out of input) but you can also declare these decision branches explicity.
The if() (conditional argument) evaluates statements that produce a TRUE or FALSE result. The general format is
if (boolean expression) {
# statement(s) will execute if the boolean expression is true.
}
Let's give it a try on a simple example.
# Practice with an if() statement
x <- c("what", "is", "truth")
if("truth" %in% x) {
print("Truth is found")
}
[1] "Truth is found"
1.2.1 More complex conditional branches may require the else() statement¶
Now that we know how to use if() statements, what if we want to give a second instruction based on the outcome of the if() statement? The else() and else if() statements exist to extend the conditional branch through additional considerations. In general, the structure looks like this:
if(boolean_expression #1) {
# statement(s) will execute if the boolean expression #1 is TRUE.
} else if (boolean_expression #2) {
# statement(s) will execute if the new boolean expression #2 is TRUE.
} else {
# statement(s) will execute if none of the above boolean expressions were TRUE.
}
You can include any number of else if() statements in the middle of the flow control but you should end with only a single else() statement or none at all. Remember, the else() statement is a catch-all, last-resort to deal with any unexpected scenarios.
# Practice with a complex if() statement
x <- c("what", "is", "truth")
# Build a complex cascade of statements looking for Truth
if("TRUTH" %in% x) {
print("TRUTH is found")
} else if ("Truth" %in% x) {
print ("Truth is found")
} else { # notice the placement of else is directly after the closing }
print ("The truth is out there somewhere")
}
[1] "The truth is out there somewhere"
Remember that the if/else statements will cascade through! Therefore, with proper ordering of your expressions, you can simplify them as we see below with a grade assignment conditional branching statement.
# Pick a student grade
grade <- 69
letterGrade <- "Unassigned"
# Long if statement for choosing grades
if (grade >= 90) { letterGrade <- "A+"
} else if (grade >= 85) { letterGrade <- "A"
} else if (grade >= 80) { letterGrade <- "A-"
} else if (grade >= 77) { letterGrade <- "B+"
} else if (grade >= 73) { letterGrade <- "B"
} else if (grade >= 70) { letterGrade <- "B-"
} else {letterGrade <- "FZ"}
# What is the assigned letter grade?
letterGrade
1.2.2 if() statements can be nested¶
Sometimes you may have a series of branching criteria that you want met or you want to perform a series of additional checks after a first level of criteria are met. In that case you may wish to use a series of nested if() statements. Let's take a look at the code cell below for an example of nesting if() statements.
# Only proceed if you have data in embryos.df
numVector <- c(1,2,3,4)
if(length(numVector) >0) {
# You have data to look at so print it
print(numVector)
if(sum(numVector) > 8) {
# Your vector has a minimum sum
print(sum(numVector))
if(prod(numVector) < 20) {
# Your vector has a product of less than 20
print(prod(numVector))
} else {
print("product is >= 20")
}
} else {
print("sum is <= 8 ")
}
} else {
print("vector is empty")
}
[1] 1 2 3 4 [1] 10 [1] "product is >= 20"
1.2.3 Use if() statements to generate system messages¶
If/else statements can also be used to perform system-wide tasks, like generating a warning or breaking a code. For example, if we are writing a file to a directory and there is already a file with the same name, we should generate a warning or simply stop. Without the warning, the existing file will be silently overwritten.
# Check if our file exists
# Use dir() to return a vector of file names and then ask if any match ours.
if(sum(dir() == "embryo_subdata_ttest.csv") > 0) {
print("Stop! A file with that same name already exists")
} else {
# The file does not exist, print the go-ahead and save the file
print("No files with the same name. Good to go!")
}
write_csv(x = subdata_ttest, file = "embryo_subdata_ttest.csv", col_names = TRUE)
[1] "Stop! A file with that same name already exists"
Challenge: Is there a cleaner way to produce our conditional?
1.2.3.1 Use effective control flow to ensure your intentions are met¶
Despite the warning output generated by our code, the file in our example would still be overwritten. The call to write.csv() is outside the control flow of the conditional if()/else(). To fulfill our true intentions, we should move the placement of the write_csv() function so that it is under the direct influence of the control flow.
# Check if our file exists
# Use dir() to return a vector of file names and then ask if any match ours.
if("embryo_subdata_ttest.csv" %in% dir()) {
print("Stop! A file with that same name already exists")
} else {
# The file does not exist, print the go-ahead and save the file
print("No files with the same name. Good to go!")
# Write the file as part of the same control statement
write_csv(x = subdata_ttest, file = "embryo_subdata_ttest.csv", col_names = TRUE)
}
[1] "Stop! A file with that same name already exists"
1.2.4 The if() and else statement is an effective control flow statement for simple tasks¶
As we've seen a couple of time in lecture now, rather than making a large control flow block for simple tasks, we can supplement the if() or ifelse() commands as a way to contain all of our conditional statements and commands in one function.
The if() else syntax can take the take the simple form of:
if (conditional_expression) TRUE_result else FALSE_result
The conditional_expression used in our statement must evaluate to a single TRUE or FALSE. In most cases, if this requirement is not met, an error will be produced, or in the case of a logical vector, a warning will be produced.
The results from the above syntax may also be assigned to a variable to use later. Let's look at the following code cells for more examples.
# Use if when x is TRUE
x <- TRUE
if(x) "True result"
# Use if when x is FALSE
x <- FALSE
if(x) "False result"
# Use if when x is NA
x <- NA
if(x) "NA result"
Error in if (x) "NA result": missing value where TRUE/FALSE needed Traceback:
# You can make complex logical expressions as long as they evaluate to either TRUE or FALSE!
x <- TRUE
y <- FALSE
z <- if(x | y) "At least one variable was TRUE!!!"
z
# Use if else when x is TRUE
x <- TRUE
if(x) "True result" else "False result"
# Use if else when x is FALSE
x <- FALSE
if(x) "True result" else "False result"
1.2.5 The ifelse() statement allows vectorized conditional assignment¶
Like the above if() statement, this allows us to assign branched output without building the full branching structure. However, as we alluded to in lecture 6, this is a much more powerful command than it appears to be as you can supply a set of vectors to this function to produce a vector of results!
ifelse(test = boolean_expression_vector,
yes = true_outcome_vector/true_outcome_action,
no = false_outcome_vector/false_outcome_action)
Watch out for vector recycling! It's convenient for re-assigning values across vectors but note that we aren't performing any complex actions or response - just assigning outcomes/values based on our evaluation expression.
# A simple example of ifelse()
rm(a)
i <- 8
ifelse(test = i < 5, yes = a <- 0, no = a <- 1)
# a
# A complex vectorized example of ifelse()
i <- c(1:10)
ifelse(test = i < 5, yes = 0, no = 1) # Can we achieve this in a simpler way?
- 0
- 0
- 0
- 0
- 1
- 1
- 1
- 1
- 1
- 1
# Don't forget that we can quickly convert booleans to numeric!
as.numeric(i >= 5)
- 0
- 0
- 0
- 0
- 1
- 1
- 1
- 1
- 1
- 1
If you are looking for more ways to do this kind of general vectorised "if" assignment, you can look into the dplyr::case_when() function which will allow multiple conditionals and specific assignment outcomes.
1.2.6 Replace long/simple/cascading if statements with switch()¶
There are a lot of simple cases where one can imagine a series of possible character input values and corresponding output values.
For instance, when examining a specific series of categories or character values, we can definitely create a complex and rather long if/else/else if series of statements. We can however, replace that long series of code with a more compact version where we simply identify the case/assignment pairings.
In programming we call these switch or case statements. Let's look at an example below.
# Pick your favourite pokemon!
pokemon <- "squirtle"
dexType <- "unknown"
# Look at how long this listing gets
if (pokemon == "bulbasaur") {dexType <- "plant"
} else if (pokemon == "squirtle") {dexType <- "water"
} else if (pokemon == "charmander") {dexType <- "fire"
} else if (pokemon == "pikachu") {dexType <- "electric"
} else if (pokemon == "lapras") {dexType <- "water/ice"
} else if (pokemon == "snorlax") {dexType <- "normal"
} else if (pokemon == "magikarp") {dexType <- "water"
} else {dexType <- "unknown input"}
dexType
As we can see above, things can get long and complicated for assigning values with an if statement.
# Pick your favourite pokemon!
pokemon <- "lapras"
dexType <-
switch(pokemon,
"bulbasaur" = "plant",
"squirtle" = "water",
"charmander" = "fire",
"pikachu" = "electric",
"lapras" = "water/ice",
"snorlax" = "normal",
"magikarp" = "water",
"unknown input")
dexType
1.3.0 Running loops without a predetermined end-point¶
There may be instances where you need to run loops on data until you find a certain piece of information, or until a specific condition is met rather than examining all of the elements within a set. There are two ways you can accomplish these "open-ended" loops.
1.3.1 while() loops run conditionally¶
Unlike using for() loops which continue to execute until a specific iteration number, the while() loop executes a command as long as a conditional expression continues to evaluate as TRUE at each iteration. This conditional expression must evaluate as TRUE to begin execution as well. The while() loop can be thought of as a special implementation of an if() statement that repeats over and over again until the conditional fails.
Let's work with some simple examples.
# Initialize our variable for conditional assessment
x <- 0
# Generate the while loop, incrementing x by 1 on each iteration, as long as x < 10
while(x < 10) {
x <- x + 1
print(x)
}
[1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 [1] 9 [1] 10
# Loop will be ignored if the condition is FALSE and nothing gets printed
x <- 20
while(x < 10) {
x <- x + 1
print(x)
}
1.3.1.1 Conditional loops can become endless¶
When programming a conditional loop you must always include a statement that alters the condition or breaks out of the upcoming loop itself. It's also important to note the order or placement of when you alter the condition in your loops. All the command statements within the loop, unless otherwise specified, will execute before the re-evaluation of the conditional statement.
For example, a programmer is assigned a task: "While you're at the grocery store, buy some eggs". The programmer never came back home.
# Set your initial value
programmer <- " at the grocery store"
# Build your while loop
while(programmer == " at the grocery store") {
print("buy some eggs")
programmer <- "bought some eggs" # What would happen if we commented out this line?
}
print(programmer)
# When do we provide the opportunity to change?
[1] "buy some eggs" [1] "bought some eggs"
1.3.2 Using next and break to exit any kind of looping structure¶
The explicit use of the next and break commands will break free from the current looping structure but each differs in what they do afterwards.
The
nextcommand will exit the current iteration in the loop structure but will return to run the next iteration of the loop.Use this to skip over or avoid specific commands within your loop.
the
breakcommand will completely exit the loop structure, as if it had reached its natural end.Use this to permanently exit your looping structure.
Let's use the following examples to see how these mechanisms work.
# using next within our for loop
for(i in 1:10) {
if (i >= 5 & i <= 8) {
next # skips ends the current iteration of the loop
}
print(i)
}
i
[1] 1 [1] 2 [1] 3 [1] 4 [1] 9 [1] 10
# Using break
for(i in 1:10) {
if (i == 5) {
break # completely exits the loop
}
print(i)
}
i
[1] 1 [1] 2 [1] 3 [1] 4
1.3.3 repeat loops run endlessly unless specifically interrupted by break¶
Unlike the while loop, which can end through the conditional being met, a repeat() loop has no explicit conditional statement built into it's formation. Instead, it will continue to repeat until it is broken out of by the break command.
# Using repeat() to endlessly loop
i = 1
repeat {
if (i == 20) {
break # completely exits the loop
}
print(i)
i = i + 1
}
i
[1] 1 [1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 [1] 9 [1] 10 [1] 11 [1] 12 [1] 13 [1] 14 [1] 15 [1] 16 [1] 17 [1] 18 [1] 19
1.3.4 Be mindful of how you iterate through your loops¶
Depending on the order in which you set up your conditionals, you may accidentally produce unexpected issues. It is best to consider the order in which you want to accomplish tasks within your loops before beginning the next iteration. This is especially relevant in the case of a conditional loop (while() or repeat) where you must include a variable that can eventually meet the desired conditions for exit.
Take the time to visually and mentally test your code using a series of base cases by asking yourself what input and output should look like: before the first iteration, after the first iteration, in the middle of your dataset, in your penultimate iteration, in your final iteration. Quickly assessing these on a small test set can also help you identify potential problems!
# Using repeat() to demonstrate that conditional placement matters.
i = 1
# What numbers will this code print?
# What happens if we move the print command around?
repeat {
i = i + 1
if (i == 20) {
break # completely exits the loop
}
print(i)
}
[1] 2 [1] 3 [1] 4 [1] 5 [1] 6 [1] 7 [1] 8 [1] 9 [1] 10 [1] 11 [1] 12 [1] 13 [1] 14 [1] 15 [1] 16 [1] 17 [1] 18 [1] 19
1.3.4.1 Use loops to simplify your code, but don't re-invent the wheel!¶
Depending on task you working on, perhaps there is already a function that satisfies your need so you don't have to use explicit for() loops. Make use of existing functions whenever you can because those have already been optimized to be fast and efficient.
Taking advantage of functions can allow you to keep your code clean rather than programming for loops to generate a simple number pattern.
Comprehension Question 1.0.0 Answer:¶
2.0.0 Increasing our complexity by combining for() loops with ggplot()¶
Let's say, we are ready to start making some plots for our manuscript, and we want to make individual plots for each infectionDate (replicate). The code below makes a boxplot for each worm strain from the 190423 replicate of our data.
library(repr)
# Note that the standard display size is a 7x7 inch space. let's double the width
options(repr.plot.width=21, repr.plot.height=7)
ggplot(embryos.df[embryos.df$infectionDate == "190423",]) +
#2 Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
# 4. Geoms
geom_boxplot()
But what if I were to have, say, multiple infection dates? In this case, a for loop will be the way to go. Take a look at the following code:
# Loop through the possible infection dates
for (i in unique(embryos.df$infectionDate)) {
infectionRep <-
ggplot(embryos.df[embryos.df$infectionDate == i,]) +
#2 Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
ggtitle("Embryo counts") + # plot title
# 4. Geoms
geom_boxplot()
print(infectionRep) # This is the only way to view the plot in a for loop
# Save each plot as it's generated
ggsave(plot = infectionRep, filename = paste(i, "boxplot.png", sep = "_"), path = "data/" ,
scale=2, device = "png", units = c("cm"))
}
Saving 33.9 x 33.9 cm image Saving 33.9 x 33.9 cm image
Saving 33.9 x 33.9 cm image
Saving 33.9 x 33.9 cm image
Saving 33.9 x 33.9 cm image
Saving 33.9 x 33.9 cm image
Saving 33.9 x 33.9 cm image
Saving 33.9 x 33.9 cm image
Saving 33.9 x 33.9 cm image
2.1.0 Take advantage of for loop variables to customize output in each loop¶
From above you can see that we can take advantage of our incrementing variables within the for loop. We can use it to help subset data, generate titles, and file names. You can use it in combination with other control statements to update the image as well! Just remember to avoid generating errors within your for() loop when access or altering data. Ensure you aren't trying to reference or alter data or subsets that do not exist due to missing information in your original datasets.
What if I want to facet our data for each infection data across sporeStrain and doseLevel?
# Loop through the possible infection dates
for (i in unique(embryos.df$infectionDate)) {
infectionRep <-
ggplot(embryos.df[embryos.df$infectionDate == i,]) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
ggtitle(paste0("Embryo counts on infection date: ", i)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(sporeStrain ~ doseLevel) ### 2.1.0 Facet our data by strain and dose
# Only print the brackish dataset
if (i == "190423") {
print(infectionRep) # The only way to see the plot is to print it within a for loop
}
# Save each plot as it's generated
ggsave(plot = infectionRep, filename = paste(i, "boxplot.facet.png", sep = "_"), path = "data/" ,
scale=2, device = "png", width = 20, height = 10, units = c("cm"))
}
3.0.0 Can we jump around our code to perform different tasks?¶
Yes! So far we've covered many options for control flow but all of our programs have been moving in a linear direction from start to end. That is also just a consequence of working with a Jupyter notebook. Programs, however, are not necessarily run in a linear fashion.
What if you need to perform a set of similar instructions multiple times, at multiple points within your control flow? Perhaps it's even the same kind of for() loop on different sets of data? There are a lot of tricks like nested loops but you're better off knowing how to make functions that can be used in other code as well!
The general structure of a script or program can be divided into
Global/environmental variables and declarations
- Describe your script and assumptions
- Import your libraries
- Declare any global variables
Main program
- The place where the main statements occur.
- It may also be a function call with specific arguments like the location of data files.
- Reading through your annotations, someone else should be able to discern what your program is doing.
Helper functions or subroutines
- Here you can create functions or "mini" programs that do work for you.
- They can be called from anywhere within the program (once loaded into memory).
- Repetitive tasks whose output only vary based on the input provided.
- Subroutines may work together or call on each other to accomplish a greater task.
- Functions that you use often can be placed into their own files for importing just like a package.
3.0.1 Do One Thing - but do it well¶
A best practice when writing functions is the "Do One Thing" principle: each function should do one thing; one task. Instead of a big function, you can write several small ones per task, without going to the other extreme which would be fragmenting your code into a ridiculous amount of code snippets. By doing the one thing, your functions become:
- More flexible
- More easily understood
- Simpler to test
- Simpler to debug
- Easier to change
Time to start writing our own functions!
3.0.2 Document your functions¶
While we have been using help() and ? to look up documentation on the various functions we've been using, our user-defined functions will not have any kind of accessible documentation. Of course if we were making specific packages for R we could create accessible documentation.
Regardless of this problem, it is best practice to document your functions much like you document the rest of your code. In this case you can include information such as:
- Description: what the function does
- Parameters: these are inputs for the function and any object-typing or formatting that is expected. You can also include a description of the default values for each.
- Returns: what is the structure of the return object?
3.1.0 Declare your own functions with function()¶
In R, a function is declared with the following syntax:
function_name = function(parameter1_name, parameter2_name, ... parameterN_name = preset_value) {
# The specific code of your function goes within the {...}
return(output)
}
Let's convert our plotting code from above into a simple function!
# Description: This function, given a set of data from the embryo.df format will produce
# a faceted series of box plots for a specific infections date splitting by sporeStrain and doseLevel
# Input:
# data.df: a data frame at least with the following column names
# $infectionDate, $wormStrain, $sporeStrain, $doseLevel
# infDate: a character string used to subset the data
# Output: make.facet.plot will generate a facet plot from data.df based on the infDate variable
# The plot will be saved to a file ending in "boxplot.facet.function.png"
make.facet.plot = function(data.df, infDate) {
infectionRep <-
# You could also filter your data with filter() and piping instead!
ggplot(data.df[data.df$infectionDate == infDate,]) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
ggtitle(paste0("Embryo counts on infection date: ", infDate)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(sporeStrain ~ doseLevel)
# Print the plot
print(infectionRep)
# Save each plot as it's generated
ggsave(plot = infectionRep, filename = paste(infDate, "boxplot.facet.makeFunction.png", sep = "_"), path = "data/" ,
scale=2, device = "png", width = 20, height = 10, units = c("cm"))
} # End of make.face.plot
3.1.1 Once your declared functions are stored in memory, they can be called from anywhere.¶
Now that our subroutine is stored in memory, it can be called as we want! Maybe even use it for different data sets as long as it meets the requirements set out in our description of the function itself. You can even build upon it to use control flow to decide if it will be faceted or not. The code between the two versions is so similar, you could break it into an if statement.
![]() |
|---|
| Call your functions from anywhere once they are stored in memory. |
Let's try to use it right now.
unique(embryos.df$infectionDate)
- 200704
- 200711
- 200718
- 200818
- 200822
- 200901
- 200912
- 200915
- 190423
Levels:
- '200704'
- '200711'
- '200718'
- '200818'
- '200822'
- '200901'
- '200912'
- '200915'
- '190423'
# Use a for loop to iterate through the first 3 levels of infectionDate
for (i in unique(embryos.df$infectionDate)[1:3]){
# Call on our function now
make.facet.plot(data.df = embryos.df, i)
}
3.2.0 Retrieve data from your function using the return() command¶
Some of your functions may generate subsets of data or results that you would like to further investigate for analysis. For example, when we generate our plots, perhaps we would like to also retrieve information like where the file was saved, along with the subset of data for each.
Using the return() command has two consequences:
- It will terminate or exit the function currently running once this command is called.
- It will return a single object that will be assigned to a variable or be displayed to the console if unassigned
A special note about the returned object. This can be any kind of object and if you want to return multiple objects, put them in a list! Let's update our function.
# Description: This function, given a set of data from the embryo.df format will produce
# a faceted series of box plots for a specific infections date splitting by sporeStrain and doseLevel
# Input:
# data.df: a data frame at least with the following column names
# $infectionDate, $wormStrain, $sporeStrain, $doseLevel
# infDate: a character string used to subset the data
# Output: make.facet.plot will generate a facet plot from data.df based on the infDate variable
# The plot will be saved to a file ending in "boxplot.facet.function.png"
# It will return
# [1] subset data
# [2] ggplot object
# [3] save plot filename
save.facet.plot = function(data.df, infDate) {
### 3.2.0 We've updated the plot to use a filter() function!
infection.data <- data.df %>% filter(infectionDate == infDate)
infectionPlot <-
ggplot(infection.data) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
ggtitle(paste0("Embryo counts on infection date: ", infDate)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(sporeStrain ~ doseLevel)
# Save the name of the plot file
save.file <- paste(infDate, "boxplot.facet.saveFunction.png", sep = "_")
# Save each plot as it's generated
ggsave(plot = infectionPlot, filename = paste(infDate, "boxplot.facet.saveFunction.png", sep = "_"), path = "data/" ,
scale=2, device = "png", width = 20, height = 10, units = c("cm"))
### 3.2.0 return the file name and data subset
# Create a list so that you can send multiple objects back as a single object
return(list(infection.data, infectionPlot, save.file))
}
# Call on save.facet.plot function now
inf200704.plot <- save.facet.plot(embryos.df, "200704")
# Look at the data
head(inf200704.plot[[1]])
# Display the plot to output
inf200704.plot[[2]]
# What's the file name?
inf200704.plot[[3]]
| experiment | wormNumber | infectionDate | wormStrain | sporeStrain | sporeDose | sporesM_cm2 | doseLevel | spores | meronts | embryos | expTimepoint | infectionType | totalWorms | plateSize | fixingDate | stainingDate | slideDate | imagingDate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <chr> | <dbl> | <fct> | <fct> | <fct> | <fct> | <dbl> | <fct> | <lgl> | <lgl> | <dbl> | <dbl> | <fct> | <dbl> | <dbl> | <fct> | <fct> | <fct> | <fct> |
| 200707_N2_LUAm1_0M_72hpi | 1 | 200704 | N2 | LUAm1 | 0 | 0.000000 | Mock | FALSE | FALSE | 18 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
| 200707_N2_LUAm1_10M_72hpi | 1 | 200704 | N2 | LUAm1 | 10 | 0.353857 | Medium | FALSE | TRUE | 7 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
| 200707_JU1400_LUAm1_0M_72hpi | 1 | 200704 | JU1400 | LUAm1 | 0 | 0.000000 | Mock | FALSE | FALSE | 10 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
| 200707_JU1400_LUAm1_10M_72hpi | 1 | 200704 | JU1400 | LUAm1 | 10 | 0.353857 | Medium | FALSE | TRUE | 0 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
| 200707_ED3052A_LUAm1_0M_72hpi | 1 | 200704 | ED3052A | LUAm1 | 0 | 0.000000 | Mock | FALSE | FALSE | 12 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
| 200707_ED3052A_LUAm1_10M_72hpi | 1 | 200704 | ED3052A | LUAm1 | 10 | 0.353857 | Medium | FALSE | TRUE | 0 | 72 | continuous | 1000 | 6 | 200707 | 200803 | 200804 | 200806 |
If you need to pass information to the function like a variable, dataframe, list etc - do this through the arguments! Whenever you need to return information, then return it as part of a list if needed. The function should be agnostic of the world around it. Some assumptions can be made like loading preset libraries outside the function, but you can even do that within your functions!
3.3.0 Arguments for your functions can have a default value¶
The last helpful part of making functions it to consider providing default values for some of your arguments. In some cases you may have a subset of datasets that need to be treated differently so including an argument for your function to toggle certain behaviours is helpful. Including these arguments, however, means you have to define them every time you call on the function unless you assign a default value.
Default values are only overridden by supplied arguments, otherwise these will be applied within your function.
Before we update our save.facet.plot() let's see what happens if we simply forget to include a parameter.
# Rerun our function without an infection date
save.facet.plot(embryos.df)
Error in `filter()`: i In argument: `infectionDate == infDate`. Caused by error: ! argument "infDate" is missing, with no default Traceback: 1. save.facet.plot(embryos.df) 2. data.df %>% filter(infectionDate == infDate) # at line 19 of file <text> 3. filter(., infectionDate == infDate) 4. filter.data.frame(., infectionDate == infDate) 5. filter_rows(.data, dots, by) 6. filter_eval(dots, mask = mask, error_call = error_call) 7. withCallingHandlers(mask$eval_all_filter(dots, env_filter), error = dplyr_error_handler(dots = dots, . mask = mask, bullets = filter_bullets, error_call = error_call), . warning = function(cnd) { . local_error_context(dots, i, mask) . warning_handler(cnd) . }, `dplyr:::signal_filter_one_column_matrix` = function(e) { . warn_filter_one_column_matrix(call = error_call) . }, `dplyr:::signal_filter_across` = function(e) { . warn_filter_across(call = error_call) . }, `dplyr:::signal_filter_data_frame` = function(e) { . warn_filter_data_frame(call = error_call) . }) 8. mask$eval_all_filter(dots, env_filter) 9. eval() 10. .handleSimpleError(function (cnd) . { . local_error_context(dots, i = frame[[i_sym]], mask = mask) . if (inherits(cnd, "dplyr:::internal_error")) { . parent <- error_cnd(message = bullets(cnd)) . } . else { . parent <- cnd . } . message <- c(cnd_bullet_header(action), i = if (has_active_group_context(mask)) cnd_bullet_cur_group_label()) . abort(message, class = error_class, parent = parent, call = error_call) . }, "argument \"infDate\" is missing, with no default", base::quote(eval())) 11. h(simpleError(msg, call)) 12. abort(message, class = error_class, parent = parent, call = error_call) 13. signal_abort(cnd, .file)
As you can see, our user-defined function throws an error when we neglect to provide an argument for the infDate parameter. Let's update the save.facet.plot() function by setting the infDate parameter to a known date "190423". This could easily be something different like setting a logical parameter to default to TRUE or FALSE, which could change internal behaviours of the function itself.
# Description: This function, given a set of data from the embryo.df format will produce
# a faceted series of box plots for a specific infections date splitting by sporeStrain and doseLevel
# Input:
# data.df: a data frame at least with the following column names
# $infectionDate, $wormStrain, $sporeStrain, $doseLevel
# infDate: a character string used to subset the data
# Output: make.facet.plot will generate a facet plot from data.df based on the infDate variable
# The plot will be saved to a file ending in "boxplot.facet.function.png"
# It will return
# [1] subset data
# [2] ggplot object
# [3] save plot filename
### 3.3.0 Set the default value of infDate to 190423
save.facet.plot = function(data.df, infDate = "190423") {
# We've updated the plot to use a filter() function!
infection.data <- data.df %>% filter(infectionDate == infDate)
infectionPlot <-
ggplot(infection.data) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
ggtitle(paste0("Embryo counts on infection date: ", infDate)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(sporeStrain ~ doseLevel)
# Save the name of the plot file
save.file = paste(infDate, "graph.facet.function.png", sep = "_")
# Save each plot as it's generated
ggsave(plot = infectionPlot, filename = paste(infDate, "boxplot.facet.saveFunction.png", sep = "_"), path = "data/" ,
scale=2, device = "png", width = 20, height = 10, units = c("cm"))
#return the file name and data subset
# Create a list so that you can send multiple objects back as a single object
return(list(infection.data, infectionPlot, save.file))
}
# Rerun our function without an infection date
save.facet.plot(embryos.df)
[[1]] # A tibble: 3,643 x 19 experiment wormNumber infectionDate wormStrain sporeStrain sporeDose <chr> <dbl> <fct> <fct> <fct> <fct> 1 190426_VC20019_LUA~ 1 190423 VC20019 LUAm1 0 2 190426_VC20019_LUA~ 1 190423 VC20019 LUAm1 10 3 190426_VC20019_LUA~ 1 190423 VC20019 LUAm1 20 4 190426_N2_LUAm1_0M~ 1 190423 N2 LUAm1 0 5 190426_N2_LUAm1_10~ 1 190423 N2 LUAm1 10 6 190426_N2_LUAm1_20~ 1 190423 N2 LUAm1 20 7 190426_AB1_LUAm1_0~ 1 190423 AB1 LUAm1 0 8 190426_AB1_LUAm1_1~ 1 190423 AB1 LUAm1 10 9 190426_AB1_LUAm1_2~ 1 190423 AB1 LUAm1 20 10 190426_JU397_LUAm1~ 1 190423 JU397 LUAm1 0 # i 3,633 more rows # i 13 more variables: sporesM_cm2 <dbl>, doseLevel <fct>, spores <lgl>, # meronts <lgl>, embryos <dbl>, expTimepoint <dbl>, infectionType <fct>, # totalWorms <dbl>, plateSize <dbl>, fixingDate <fct>, stainingDate <fct>, # slideDate <fct>, imagingDate <fct> [[2]] [[3]] [1] "190423_graph.facet.function.png"
3.4.0 User-defined functions can also define functions¶
While a rarer occurence, your user-defined functions can be used to instantiate and return a function itself. In these cases, the scoping of your variables can become a little trickier but variables within your code can be set using parameters from the initial function.
Let's start with a simple example before we return to our plot-saving function.
# Define our function(s)
make.power = function(power) { # This sets the variable values (via lexical scoping) of the exponent
pow = function(base) { # When we call on the resulting function it will require a base value
base^power # Make the actual calculation
}
}
# Define a new function that does cubic calculations
cube = make.power(3)
# Now we have a function cube() that takes a parameter called base to calculate base^power
# Call on our cubic function using a base of 4
cube(4)
Now let's revisit our plot-saving function. We'll make a new plot-setting function that we can use to permanently set the data frame that is used when making plots. We can initialize this newly set function and save it as the function set.facet.plot().
# Description: This function, given a set of data from the embryo.df format will produce
# a faceted series of box plots for a specific infections date splitting by sporeStrain and doseLevel
# Input:
# data.df: a data frame at least with the following column names
# $infectionDate, $wormStrain, $sporeStrain, $doseLevel
# infDate: a character string used to subset the data
# Output: make.facet.plot will generate a facet plot from data.df based on the infDate variable
# The plot will be saved to a file ending in "boxplot.facet.function.png"
# It will return
# [1] subset data
# [2] ggplot object
# [3] save plot filename
### 3.4.0 Define a new function where we set the data.df parameter as input.
set.facet.plot = function(data.df) {
# Set the default value of infDate to 190423
save.facet.plot = function(infDate = "190423") {
# We've updated the plot to use a filter() function!
infection.data <- data.df %>% filter(infectionDate == infDate)
infectionPlot <-
ggplot(infection.data) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
ggtitle(paste0("Embryo counts on infection date: ", infDate)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(sporeStrain ~ doseLevel)
# Save the name of the plot file
save.file = paste(infDate, "graph.facet.function.png", sep = "_")
# Save each plot as it's generated
ggsave(plot = infectionPlot, filename = paste(infDate, "boxplot.facet.saveFunction.png", sep = "_"), path = "data/" ,
scale=2, device = "png", width = 20, height = 10, units = c("cm"))
#return the file name and data subset
return(list(infection.data, infectionPlot, save.file))
}
}
# Step 2. Make a function where the data set is embryos.df
make.embryo.plot <- set.facet.plot(embryos.df)
# Make a plot and filter it by infection Date
infection.results <- make.embryo.plot("200704")
infection.results[[2]]
3.5.0 The stop() function exits a function with a message¶
Sometimes you might produce a function that could fail at a number of points for various reasons. While the R-kernel may simply produce a warning and proceed, you may wish to stop the function wherever it is rather than proceeding. Using the stop() function can help produce "controlled" error stopping points in your program. You can also include an optional message that will help to clarify why you have stopped the function.
First, however, let's produce a simple example of using the stop() function.
# Let's see what happens when we work with the log function
log10(1)
log10(0)
log10(-1)
Warning message in eval(expr, envir, enclos): "NaNs produced"
Suppose we aren't interested in producing -Inf or NaN values? We can build a wrapper around the log10 function with some conditional branching inside it.
get.log10 = function(x) {
if(x <= 0) stop("Execution stopped: ", x, " is not acceptable input")
log10(x)
}
get.log10(1) # test our function
get.log10(-1) # Check it will stop when it's supposed to
get.log10(10) # Will this code run?
Error in get.log10(-1): Execution stopped: -1 is not acceptable input
Traceback:
1. get.log10(-1)
2. stop("Execution stopped: ", x, " is not acceptable input") # at line 3 of file <text>
3.6.0 Use tryCatch() to identify errors without stopping¶
In our above example of stop() the result of using it halts the execution of our code. Instead, sometimes we may wish to note an error has occured but we also want to proceed with the remainder of the code. In that case you can use the tryCatch() function which takes on a somewhat complex structure.
The tryCatch() function can be used to run an expression (or lines of code) and if an error or warning is produced, it can catch the result without halting your program's execution. Additional message information can be produced in each case so that the user can be warned of potential issues. Using tryCatch() takes the form of:
func_name = function(input) {
out <- tryCatch({ ## This is where we try code that might fail
expression(s) },
warning = function(condition) {
## statements to execute upon warning
message("Optional consolidated warning message")
return() # optional return value
},
error = function(condition) {
## statements to execute upon error
message("Optional consolidated error message")
return() # optional return value
},
finally = {
## Code to complete regardless of an error
}
) ## End of tryCatch
return(out)
}
3.6.1 Remember that your functions should do one thing well¶
Let's focus again on our plotting functions we produced. Previously our versions of save.facet.plot() included steps where the input was being filtered - sometimes by sub-functions that should just be producing a plot object. To remedy this we'll go back to our rule of "Do One Thing" and we'll generate make.facet.plot() so that it's sole purpose is to produce a plot when given a filtered dataset infection.data and a specific infection date infDate.
# Simplify our main function which takes in pre-filtered data and plots it
# Define a new function where we set the data.df parameter as input.
# @Input
# infection.data: a filtered set of infection data that represents a single replicate date
# infDate: the actual replicate date that will be used in the title information
# Set the default value of infDate to 190423
make.facet.plot = function(infection.data, infDate) {
infectionPlot <-
ggplot(infection.data) +
# 2. Aesthetics
aes(x = wormStrain, y = embryos, fill = wormStrain) +
theme_grey() +
ggtitle(paste0("Embryo counts on infection date: ", infDate)) + # plot title
# 4. Geoms
geom_boxplot() +
# 6. Facets
facet_grid(sporeStrain ~ doseLevel)
return(infectionPlot)
}
3.6.2 Call on subfunctions within your function to simplify debugging¶
One of the things you can do as your functions and needs become more complex is to nest functions within other functions. We've already applied this when we call ggplot() functions within save.facet.plot().
Next we want to generate a second function that will be able to filter a set like embryos.df, call on make.facet.plot(), and save the results as needed. In doing so we simplify the debugging process and it will help when we begin to incorporate a tryCatch() structure into our code.
# Make a function to filter data, make the plot, then save the plot
# @Input
# data.df: A dataset containing individual observations of embryo counts with at least:
# $wormStrain, $embryos, $infectionDate
# infDate: The infection date to filter data.df
#
# @Output
# List of 3 objects: 1) Filtered infection data
# 2) A ggplot object of the infection data
# 3) file name of saved plot
save.facet.plot = function(data.df, infDate) {
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# make the plotted data
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction2.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
}
# save a facet plot but look at the output of the plot
save.facet.plot(embryos.df, "200704")[[2]]
Saving 33.9 x 33.9 cm image
3.6.3 Test the boundary cases of your function¶
Here's where we need to get creative. What would happen inside save.facet.plot() if we happened to forget to supply a infDate parameter to our call? Previously we included a default value like "190423" but we have no do so here. Using a call like save.face.plot(embryos.df) will produce an error.
# save a facet plot but but don't provide a salinity type
save.facet.plot(embryos.df)[[2]]
Error in `filter()`: i In argument: `infectionDate == infDate`. Caused by error: ! argument "infDate" is missing, with no default Traceback: 1. save.facet.plot(embryos.df) 2. filter(data.df, infectionDate == infDate) # at line 15 of file <text> 3. filter.data.frame(data.df, infectionDate == infDate) 4. filter_rows(.data, dots, by) 5. filter_eval(dots, mask = mask, error_call = error_call) 6. withCallingHandlers(mask$eval_all_filter(dots, env_filter), error = dplyr_error_handler(dots = dots, . mask = mask, bullets = filter_bullets, error_call = error_call), . warning = function(cnd) { . local_error_context(dots, i, mask) . warning_handler(cnd) . }, `dplyr:::signal_filter_one_column_matrix` = function(e) { . warn_filter_one_column_matrix(call = error_call) . }, `dplyr:::signal_filter_across` = function(e) { . warn_filter_across(call = error_call) . }, `dplyr:::signal_filter_data_frame` = function(e) { . warn_filter_data_frame(call = error_call) . }) 7. mask$eval_all_filter(dots, env_filter) 8. eval() 9. .handleSimpleError(function (cnd) . { . local_error_context(dots, i = frame[[i_sym]], mask = mask) . if (inherits(cnd, "dplyr:::internal_error")) { . parent <- error_cnd(message = bullets(cnd)) . } . else { . parent <- cnd . } . message <- c(cnd_bullet_header(action), i = if (has_active_group_context(mask)) cnd_bullet_cur_group_label()) . abort(message, class = error_class, parent = parent, call = error_call) . }, "argument \"infDate\" is missing, with no default", base::quote(eval())) 10. h(simpleError(msg, call)) 11. abort(message, class = error_class, parent = parent, call = error_call) 12. signal_abort(cnd, .file)
3.6.4 Implement a tryCatch() series to try and capture your error¶
Instead of allowing the execution to halt when we reach an error maybe we can produce some messages and return a null value? In this implementation we will return a NULL value for the user to deal with.
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(data.df, infDate) {
# Initiate a tryCatch and assign its output to a variable
out <- tryCatch({
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# What happens if we pick a non-existent infection date or if the parameter is not set?
},
error = function(c) { # The actual error information is passed in as the variable "c"
# Assume the error occurs when no infDate is provided
message("Error: potentially missing parameter information")
return(NULL)
# return NULL so we can recognize an error has occurred
# You're kind of a function inside a function.
# when you return, you'll leave to the next section.
}) # End tryCatch
if (!is.null(out)) {
# make the plotted data
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction.tryCatch.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
}
else return(out) # if it's an error we'll get the NULL
}
# save a facet plot but look at the output of the plot
save.facet.plot(embryos.df)[[2]]
Error: potentially missing parameter information
NULL
3.6.5 Use tryCatch() to set values within your function¶
Suppose instead of just returning a NULL value when we produce an error, we can change values on the user's behalf and continue? Of course our example here is in the context of an expected error and we can't always account for the nature of the error(s) we'll encounter. You could make things more complex and try to program some statements to determine the error type!
In our example, we'll try to anticipate the issue of a missing salinity value and "assume" that will be our only problem. We'll take advantage of the <<- scoping assignment operator. It will search the hierarchy of scopes until it can assign a value to the specified variable. This happens in place of R dynamically assigning a local variable.
Let's modify save.facet.plot() function so that our error handler can set the salinity.val variable within save.facet.plot().
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(data.df, infDate) {
# Initiate a tryCatch and assign its output to a variable
out <- tryCatch({
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# What happens if we pick a non-existent infection date or if the parameter is not set?
},
error = function(c) { # The actual error information is passed in as the variable "c"
# Assume the error occurs when no infDate is provided
message("Warning: No infection date provided")
message("substituting with a first-level value")
### 3.6.5 Remember: we are in a mini function at the moment
# We need to go up a level and set infDate within the save.facet.plot function
# Set it to the first level of $infectionDate
infDate <<- levels(data.df$infectionDate)[1]
# Then we'll remake the data subset
infection.data <<- filter(data.df, infectionDate == infDate)
# This will allow us to proceed with the rest of the code as though nothing were amiss
# return(NULL)
# return NULL so we can recognize an error has occurred
# You're kind of a function inside a function.
# when you return, you'll leave to the next section.
}) # End tryCatch
# make the plotted data
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction.tryCatchv2.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
}
# save a facet plot but look at the output of the plot
save.facet.plot(embryos.df)[[2]]
Warning: No infection date provided substituting with a first-level value Saving 33.9 x 33.9 cm image
Here's an alternative version of our code that runs all of the code within the tryCatch call using the finally option.
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(data.df, infDate) {
# Initiate a tryCatch and assign its output to a variable
out <- tryCatch({
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# What happens if we pick a non-existent infection date or if the parameter is not set?
},
error = function(c) { # The actual error information is passed in as the variable "c"
# Assume the error occurs when no infDate is provided
message("Warning: No infection date provided")
message("substituting with a first-level value")
# Remember: we are in a mini function at the moment
# We need to go up a level and set infDate within the save.facet.plot function
# Set it to the first level of $infectionDate
infDate <<- levels(data.df$infectionDate)[1]
# Then we'll remake the data subset
infection.data <- filter(data.df, infectionDate == infDate)
# This will allow us to proceed with the rest of the code as though nothing were amiss
# return(NULL)
# return NULL so we can recognize an error has occurred
# You're kind of a function inside a function.
# when you return, you'll leave to the next section.
},
### Set up a finally section which runs regardless of whether or not an error has been caught
# You could put all or some of your end-code here depending on your needs
# We'll move all the post-catch code into here for fun
finally = {
# make the plotted data
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction.tryCatchv2.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
}) # End tryCatch
}
# save a facet plot but look at the output of the plot
save.facet.plot(embryos.df)[[2]]
Warning: No infection date provided substituting with a first-level value Saving 33.9 x 33.9 cm image
3.6.6 Use flow control to handle error situations¶
So it looks like we've provided some leeway for the user in case they fail to provide any sort of infection date to subset our data. What if, however, they simply provide an incorrect date? Let's see what the result will be if we try to run our current version of save.facet.plot with an incorrect date.
save.facet.plot(embryos.df, "221019")[[2]]
Saving 33.9 x 33.9 cm image
Error in `combine_vars()`: ! Faceting variables must have at least one value Traceback: 1. save.facet.plot(embryos.df, "221019") 2. tryCatch({ . infection.data <- filter(data.df, infectionDate == infDate) . }, error = function(c) { . message("Warning: No infection date provided") . message("substituting with a first-level value") . infDate <<- levels(data.df$infectionDate)[1] . infection.data <- filter(data.df, infectionDate == infDate) . }, finally = { . infection.plot <- make.facet.plot(infection.data, infDate) . save.file = paste(infDate, "graph.facet.saveFunction.tryCatchv2.png", . sep = "_") . ggsave(plot = infection.plot, filename = save.file, path = "data/", . scale = 2, device = "png", units = c("cm")) . return(list(infection.data, infection.plot, save.file)) . }) # at line 5-47 of file <text> 3. ggsave(plot = infection.plot, filename = save.file, path = "data/", . scale = 2, device = "png", units = c("cm")) # at line 41-42 of file <text> 4. grid.draw(plot) 5. grid.draw.ggplot(plot) 6. print(x) 7. print.ggplot(x) 8. ggplot_build(x) 9. ggplot_build.ggplot(x) 10. layout$setup(data, plot$data, plot$plot_env) 11. setup(..., self = self) 12. self$facet$compute_layout(data, self$facet_params) 13. compute_layout(..., self = self) 14. combine_vars(data, params$plot_env, rows, drop = params$drop) 15. cli::cli_abort("Faceting variables must have at least one value") 16. rlang::abort(message, ..., call = call, use_cli_format = TRUE, . .frame = .frame) 17. signal_abort(cnd, .file)
As is the case above, we've accounted for a lack of input when we call on save.facet.plot but not for the situation where the input provided is incorrect! If we wanted to add another layer of protection, we'd have to include that, or add some flow control as below!
# Make a function to filter data, make the plot, then save the plot
save.facet.plot = function(data.df, infDate) {
# Initiate a tryCatch and assign its output to a variable
out <- tryCatch({
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# What happens if we pick a non-existent infection date or if the parameter is not set?
},
error = function(c) { # The actual error information is passed in as the variable "c"
# Assume the error occurs when no infDate is provided
message("Error: No infection date provided")
message("substituting with a first-level value")
# Remember: we are in a mini function at the moment
# We need to go up a level and set infDate within the save.facet.plot function
infDate <<- levels(data.df$infectionDate)[1]
# Then we'll remake the data subset
infection.data <<- filter(data.df, infectionDate == infDate)
# This will allow us to proceed with the rest of the code as though nothing were amiss
}) # End the tryCatch having made a subset or default version if no infDate is supplied
# In the case of an INCORRECT infDate there are a couple of ways to go about doing it
# Check if our filtered data has any rows
if(nrow(infection.data) > 0) {
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction.tryCatchv2.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
} else {
# filter results in 0-row subset
return(rep("Warning: data filtering resulted in 0-row subset", 3))
}
}
# save.facet.plot(embryos.df)[[2]]
save.facet.plot(embryos.df, "221019")[[2]]
Hint: How do you check your unfiltered data for different factor levels? Which variable will you query?
# Comprehension Question 3.0.0
# Make a function to filter data, make the plot, then save the plot
save.facet.plot.updated = function(data.df, infDate) {
# Initiate a tryCatch and assign its output to a variable
out <- tryCatch({
# filter the data
infection.data <- filter(data.df, infectionDate == infDate)
# What happens if we pick a non-existent infection date or if the parameter is not set?
},
error = function(c) { # The actual error information is passed in as the variable "c"
# Assume the error occurs when no infDate is provided
message("Error: No infection date provided")
message("substituting with a first-level value")
# Remember: we are in a mini function at the moment
# We need to go up a level and set infDate within the save.facet.plot function
infDate <<- levels(data.df$infectionDate)[1]
# Then we'll remake the data subset
infection.data <<- filter(data.df, infectionDate == infDate)
# This will allow us to proceed with the rest of the code as though nothing were amiss
return(NULL)
# return NULL so we can recognize an error has occurred
# You're kind of a function inside a function.
# when you return, you'll leave to the next section.
}) # End the tryCatch having made a subset or default version if no infDate is supplied
# In the case of an INCORRECT infDate, how can we tell if it isn't a possible value?
if(...) {
infection.plot <- make.facet.plot(infection.data, infDate)
# generate a save file name
save.file = paste(infDate, "graph.facet.saveFunction.tryCatchv2.png", sep = "_")
# save the plot
ggsave(plot = infection.plot, filename = save.file, path = "data/" ,
scale=2, device = "png", units = c("cm"))
#return the file name and data subset
return(list(infection.data, infection.plot, save.file))
} else {
# filter results in 0-row subset
return(rep("Error: infData provided is not a level of infectionDate", 3))
}
}
Now that you have the basics, you can continue to build on complexity (or simplicity) as you need it.
![]() |
|---|
| Call your functions from anywhere once they are stored in memory. |
4.0.0 Taking full advantage of your R environment¶
While working within the R environment we've learned to manipulate data and save it's output as text or excel files. We've also learned to generate our own functions and save output as variables. When we create very useful functions and want to keep the code, there isn't a need to necessarily copy and paste it into every script we make either.
In this last section we will discover how we can import our own functions, save data objects, and load R workspaces into memory.
4.1.0 Keep all of your helper functions and subroutines in a file as a source()¶
As a final extension of our control flow lesson, you already know about packages - these hold functions and data that are pre-made by others within the R community. You normally install these with install.packages() and then load them into memory with library().
You don't need to make your own packages to get similar capabilities using your customized functions. Instead, you can certainly make source files to keep functions and pertinent variables you may re-use in all of your analyses.
To access a saved "R" file which contains purely code (and comments!), you can use the source() command. Let's try!
#?source
# Load data and information from another R script
source("./data/Lecture07.R")
4.2.0 Query your environment with ls() to find variables and functions¶
After loading your script into memory, you may want to see what is available in your environment's memory. The ls() command allows you to see what is available but it does not discriminate between objects or functions.
# See what variables and functions you have in memory
print(ls())
[1] "a" "anything" "codon_translation" [4] "codonToAA" "cube" "days" [7] "dexType" "embryos.df" "get.log10" [10] "goodbye.class" "grade" "i" [13] "inf200704.plot" "infection.data" "infection.results" [16] "infectionRep" "letterGrade" "make.embryo.plot" [19] "make.facet.plot" "make.power" "n" [22] "numVector" "plot" "pokemon" [25] "programmer" "result" "save.facet.plot" [28] "set.facet.plot" "subdata" "subdata_ttest" [31] "surprise.class" "t" "tally" [34] "variable" "x" "y" [37] "z"
4.2.1 Check functions loaded into memory with lsf.str()¶
As you can see from above, using ls() returns all the objects currently saved in memory but also the functions we've previously declared and possibly some new ones imported from our call to source(). To see which functions we have loaded outside of those from packages in memory, we can use lsf.str(). Let's see what's new and try something out.
# To see which functions are available in memory
lsf.str()
codonToAA : function (codon) cube : function (base) get.log10 : function (x) goodbye.class : function () make.embryo.plot : function (infDate = "190423") make.facet.plot : function (infection.data, infDate) make.power : function (power) save.facet.plot : function (data.df, infDate) set.facet.plot : function (data.df) surprise.class : function (name)
# Let's look at a new function from "Lecture07.R"
codonToAA
function (codon)
{
return(str_replace_all(codon, codon_translation))
}
# Look up newly added variables
codon_translation
# Use codonToAA on a single codon
codonToAA("AUA")
# Use codonToAA on multiple codons
codonToAA(c("AUA", "UAG")) %>% str_flatten()
- UUU
- 'F'
- UCU
- 'S'
- UAU
- 'Y'
- UGU
- 'C'
- UUC
- 'F'
- UCC
- 'S'
- UAC
- 'Y'
- UGC
- 'C'
- UUA
- 'L'
- UCA
- 'S'
- UAA
- '*'
- UGA
- '*'
- UUG
- 'L'
- UCG
- 'S'
- UAG
- '*'
- UGG
- 'W'
- CUU
- 'L'
- CCU
- 'P'
- CAU
- 'H'
- CGU
- 'R'
- CUC
- 'L'
- CCC
- 'P'
- CAC
- 'H'
- CGC
- 'R'
- CUA
- 'L'
- CCA
- 'P'
- CAA
- 'Q'
- CGA
- 'R'
- CUG
- 'L'
- CCG
- 'P'
- CAG
- 'Q'
- CGG
- 'R'
- AUU
- 'I'
- ACU
- 'T'
- AAU
- 'N'
- AGU
- 'S'
- AUC
- 'I'
- ACC
- 'T'
- AAC
- 'N'
- AGC
- 'S'
- AUA
- 'I'
- ACA
- 'T'
- AAA
- 'K'
- AGA
- 'R'
- AUG
- 'M'
- ACG
- 'T'
- AAG
- 'K'
- AGG
- 'R'
- GUU
- 'V'
- GCU
- 'A'
- GAU
- 'D'
- GGU
- 'G'
- GUC
- 'V'
- GCC
- 'A'
- GAC
- 'D'
- GGC
- 'G'
- GUA
- 'V'
- GCA
- 'A'
- GAA
- 'E'
- GGA
- 'G'
- GUG
- 'V'
- GCG
- 'A'
- GAG
- 'E'
- GGG
- 'G'
# Let's try surprise.class()
surprise.class("class")
4.3.0 save() objects or your whole kernel memory!¶
From time to time you may have objects from analyses that aren't easily translated back as data tables or excel files. Perhaps you may want to save objects or plots from a complex analysis for later use. You can accomplish this with the save() command by providing a list of one or more objects to save.
print(ls())
save(inf200704.plot, subdata, embryos.df,
file="./data/Lecture07.RData") # Note the filetype we use to save data is "RData"
[1] "a" "anything" "codon_translation" [4] "codonToAA" "cube" "days" [7] "dexType" "embryos.df" "get.log10" [10] "goodbye.class" "grade" "i" [13] "inf200704.plot" "infection.data" "infection.results" [16] "infectionRep" "letterGrade" "make.embryo.plot" [19] "make.facet.plot" "make.power" "n" [22] "numVector" "plot" "pokemon" [25] "programmer" "result" "save.facet.plot" [28] "set.facet.plot" "subdata" "subdata_ttest" [31] "surprise.class" "t" "tally" [34] "variable" "x" "y" [37] "z"
4.3.1 save.image() saves your entire workspace¶
Sometimes you just want to save everything in memory. This may be a safeguard against accidental errors after running long aalyses. The same can be said about saving single objects but you may find this a useful command in the future.
# Save an image of everything to an RData file
save.image(file="./data/Lecture07.all.RData")
4.4.0 load() .RData files into memory¶
When you're finally ready to revisit your saved objects or memory, you'll want to restore them. It's as easy as using the command load(). Let's demonstrate, but first we need to clean up our current memory with rm()
# Clear memory
rm(list = ls())
# check that it's clear
print(ls())
character(0)
# reload it all
load("./data/Lecture07.all.RData")
print(ls())
[1] "a" "anything" "codon_translation" [4] "codonToAA" "cube" "days" [7] "dexType" "embryos.df" "get.log10" [10] "goodbye.class" "grade" "i" [13] "inf200704.plot" "infection.data" "infection.results" [16] "infectionRep" "letterGrade" "make.embryo.plot" [19] "make.facet.plot" "make.power" "n" [22] "numVector" "plot" "pokemon" [25] "programmer" "result" "save.facet.plot" [28] "set.facet.plot" "subdata" "subdata_ttest" [31] "surprise.class" "t" "tally" [34] "variable" "x" "y" [37] "z"
Let's review our time together. Over the span of this course we've discussed
- Basic data types, objects and classes in R.
- Data manipulation with the
dplyrpackage. - Principles of tidy data using the
tidyversepackage. - The grammar of graphics with the
ggplot2package. - Regular expressions and string manipulation with
stringr. - Regression and data analysis with general linear methods.
- Control flow through looping and functions.
You now have the tools to accomplish quite a few tasks and the foundation to grow your skills as needed. Let's run a final function together to celebrate!
# Time to run our final function together
...
5.1.0 Post-course survey¶
There is no post-lecture assessment this week. Your DataCamp accounts will continue to remain active for another ~4 months during which time you can choose to explore the site's different courses. Please take advantage of this opportunity to keep growing your R skills!
However, we have created a post-course survey you can fill out anonymously. You can use this survey as an opportunity to tell us about your experience and help shape the future offerings of this series. Please take 5-10 minutes to fill out the survey. We really appreciate your feedback!
Anonymous Google Survey found here
![]() |
|---|
| Don't forget to submit your term project! |
5.2.0 Submit your completed skeleton notebook (2% of final grade)¶
At the end of this lecture a Quercus assignment portal will be available to submit your completed skeletons from today (including the comprehension question answers!). These will be due one week later, before the next lecture. Each lecture skeleton is worth 2% of your final grade but a bonus 0.5% will also be awarded for submissions made within 24 hours from the end of lecture (ie 1600 hours the following day). To print your notebook:
- From the Jupyter Notebook, select
File> Print Preview - Use the resulting webpage to print/save as a PDF through your browsers
Printmenu.
Note that printing your notebook directly without using this method can result in cut-off coding cell text!
5.3.0 Final assignment guidelines (50% of final grade)¶
Your final project will be due two weeks after this lecture at 23:59 hours on Wednesday November 8th. Please submit your final assignment as a single compressed file which will include:
- Your Jupyter Notebook final project
- A PDF version of the notebook with all output from the code cell. This will be used for markup and comments that I can return to you about your projects.
- Any associated data needed to run your project. When you create your compressed file for submission, you can preserve the folder structure by compressing the entire folder with the needed files.
Please refer to the marking rubric found in this courses root directory on JupyterHub for additional instructions.
You can build your Jupyter Notebooks on the UofT JupyterHub and save/download the files to your personal computer for compressing before submitting on Quercus.
![]() |
|---|
| Any additional questions can be emailed to me or the TAs or posted to the Discussion section of Quercus. Best of luck! |
5.3.0 Acknowledgements¶
Revision 1.0.0: materials prepared in R Markdown by Oscar Montoya, M.Sc. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.0: edited and prepared for CSB1020H F LEC0142, 09-2021 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.1: edited and prepared for CSB1020H F LEC0142, 09-2022 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
Revision 1.1.2: edited and prepared for CSB1020H F LEC0142, 09-2023 by Calvin Mok, Ph.D. Bioinformatician, Education and Outreach, CAGEF.
5.4.0 Your DataCamp academic subscription¶
This class is supported by DataCamp, the most intuitive learning platform for data science and analytics. Learn any time, anywhere and become an expert in R, Python, SQL, and more. DataCamp’s learn-by-doing methodology combines short expert videos and hands-on-the-keyboard exercises to help learners retain knowledge. DataCamp offers 350+ courses by expert instructors on topics such as importing data, data visualization, and machine learning. They’re constantly expanding their curriculum to keep up with the latest technology trends and to provide the best learning experience for all skill levels. Join over 6 million learners around the world and close your skills gap.
Your DataCamp academic subscription grants you free access to the DataCamp's catalog for 6 months from the beginning of this course. You are free to look for additional tutorials and courses to help grow your skills for your data science journey. Learn more (literally!) at DataCamp.com.

5.5.0 Resources¶
- A primer on flow control https://adv-r.hadley.nz/control-flow.html
- Conditional statements and for loops https://resbaz.github.io/2014-r-materials/lessons/30-control-flow/
- More on control structures https://bookdown.org/rdpeng/rprogdatascience/control-structures.html







